library(tidyverse)
library(dplyr) # for data manipulation
library(tidyr) # to tidy data
library(ggplot2) # for the plots
library(lightgbm) # for LightGBM model
library(corrplot) # for correlation plot
library(reshape2)
library(forecast)
library(plotly)Stock market prediction
The objective of this project is to predict stock returns from the panel-type data.
Importing libraries
Let us load the data and packages.
Functions used in this notebook
This includes some functions I created to be used in this notebook.
drop_ghg_columns <- function(data) {
data <- data[, !colnames(data) %in% c("ghg_s1", "ghg_s2", "ghg_s3", "year_month")]
return(data)
}
z_score_normalization <- function(x) {
(x - mean(x)) / sd(x)
}Data wrangling
Let us load the dataset that comes in a RData format.
load("data/stocks_clean.RData") # Loading the dataset
df <- stocks_clean # Assigning the dataset to a variable df
rm(stocks_clean) # now that df is assigned, remove stocks_clean to save memoryDataset description
Let us have a first look at the dataset df.
dim(df) # checking the dimensions of the dataset[1] 289271 13
head(df) # Viewing the first few observations| ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | ghg_s1 | ghg_s2 | ghg_s3 | return |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAON US Equity | 1995-12-31 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 0.8628 | 71.728 | 14.720 | NA | NA | NA | -0.0980883 |
| AAON US Equity | 1996-01-31 | 0.4719 | 32.8520 | 2.4256 | 85.9073 | 3.0722 | 63.087 | 67.346 | NA | NA | NA | -0.0651743 |
| AAON US Equity | 1996-02-29 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 3.0722 | 97.639 | 67.346 | NA | NA | NA | 0.0697182 |
| AAON US Equity | 1996-03-31 | 0.4170 | 29.0367 | 2.0805 | 65.1878 | 3.1180 | 100.450 | 13.438 | NA | NA | NA | -0.1739303 |
| AAON US Equity | 1996-04-30 | 0.3841 | 26.7444 | 1.9162 | 65.1878 | 3.1180 | 76.133 | 13.438 | NA | NA | NA | -0.0788969 |
| AAON US Equity | 1996-05-31 | 0.3951 | 27.5445 | 1.9710 | 65.1878 | 3.1180 | 88.304 | 13.438 | NA | NA | NA | 0.0286384 |
tail(df) # Viewing the last few observations | ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | ghg_s1 | ghg_s2 | ghg_s3 | return |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ZIXI US Equity | 2022-07-29 | 8.485 | 481.8671 | 14.1059 | 146.8623 | -3.7101 | 4.118 | 64.85 | NA | NA | NA | 0 |
| ZIXI US Equity | 2022-08-31 | 8.485 | 481.8671 | 14.1059 | 146.8623 | -3.7101 | 4.118 | 64.85 | NA | NA | NA | 0 |
| ZIXI US Equity | 2022-09-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
| ZIXI US Equity | 2022-10-31 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
| ZIXI US Equity | 2022-11-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
| ZIXI US Equity | 2022-12-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
sapply(df, class) # Checking the dtypes of each column ticker date price market_cap price_to_book
"character" "Date" "numeric" "numeric" "numeric"
debt_to_equity profitability volatility revenue ghg_s1
"numeric" "numeric" "numeric" "numeric" "numeric"
ghg_s2 ghg_s3 return
"numeric" "numeric" "numeric"
Variables definition
ticker (char): Unique identifier for each company listed on the stock exchange, represented by a series of letters.date (date): Date of recorded value.price (numeric, dbl): Stock price on the given date.market_cap (numeric, dbl): Total market value of a company’s outstanding shares of stock.price_to_book (numeric, dbl): Ratio compares a company’s market value to its book value, providing insights into how the market values the company versus its actual net asset value.debt_to_equity (numeric, dbl): Measures a company’s financial leverage by comparing its total liabilities to shareholders’ equity.profitability (numeric, dbl): Refers to various metrics that measure a company’s ability to generate earnings relative to its revenue, assets, equity, or other financial metrics.volatility (numeric, dbl): Stock’s price volatility, which is a measure of the dispersion of returns for a given stock.revenue (numeric, dbl): Total income generated by the company from its business activities before any expenses are deducted.ghg_s1, ghg_s2, ghg_s3 (numeric, dbl): These columns refer to greenhouse gas emissions, categorized by scope 1, scope 2, and scope 3 emissions. Scope 1 covers direct emissions from owned or controlled sources, scope 2 covers indirect emissions from the generation of purchased electricity, and scope 3 includes all other indirect emissions that occur in a company’s value chain.return (numeric, dbl): Stock’s return over a specific period. This is the value that we need to predict.
# Calculating number of unique companies/tickers
print(paste("Number of unique companies:", length(unique(df$ticker))))[1] "Number of unique companies: 885"
# Calculating duration
min_date <- min(df$date)
max_date <- max(df$date)
duration_months <- round(as.numeric(max_date - min_date) / 12)
print(paste("Total duration from", min_date, "to", max_date, "that is approximately", duration_months, "months"))[1] "Total duration from 1995-12-29 to 2023-03-31 that is approximately 830 months"
# Checking if time series is complete
df$year_month <- format(df$date, "%Y-%m")
min_date <- min(df$date)
max_date <- max(df$date)
full_sequence <- seq(from = as.Date(format(min_date, "%Y-%m-01")),
to = as.Date(format(max_date, "%Y-%m-01")),
by = "month")
full_year_month <- format(full_sequence, "%Y-%m")
missing_months <- setdiff(full_year_month, df$year_month)
# Used chatgpt for this code
length(missing_months)[1] 0
The time series is complete.
The dataset contains monthly observations of financial data for 885 US firms (tickers) from December, 1995 to March, 2023 with data for 27 years i.e, ~830 months with ~289k rows and 13 columns.
Checking is any ticker is not available in the last two years of data as that will be useful while splitting.
x <- max_date - lubridate::years(2)
df |>
group_by(ticker) |>
summarise(last_entry = max(date, na.rm = TRUE)) |>
filter(last_entry <= x) |>
pull(ticker) [1] "CR US Equity"
# Removing all rows with "CR US Equity" ticker
df <- df |>
filter(ticker != "CR US Equity")Structure and statistics of the dataset
Structure
str(df[,1:13])tibble [289,244 × 13] (S3: tbl_df/tbl/data.frame)
$ ticker : chr [1:289244] "AAON US Equity" "AAON US Equity" "AAON US Equity" "AAON US Equity" ...
$ date : Date[1:289244], format: "1995-12-31" "1996-01-31" ...
$ price : num [1:289244] 0.505 0.472 0.505 0.417 0.384 ...
$ market_cap : num [1:289244] 35.1 32.9 35.1 29 26.7 ...
$ price_to_book : num [1:289244] 2.59 2.43 2.59 2.08 1.92 ...
$ debt_to_equity: num [1:289244] 85.9 85.9 85.9 65.2 65.2 ...
$ profitability : num [1:289244] 0.863 3.072 3.072 3.118 3.118 ...
$ volatility : num [1:289244] 71.7 63.1 97.6 100.5 76.1 ...
$ revenue : num [1:289244] 14.7 67.3 67.3 13.4 13.4 ...
$ ghg_s1 : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
$ ghg_s2 : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
$ ghg_s3 : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
$ return : num [1:289244] -0.0981 -0.0652 0.0697 -0.1739 -0.0789 ...
We only have numerical data and one date format in this dataset. Also, since the forecasting will be company dependent, ticker is also important.
Statistics
statistics <- summary(df)
statistics ticker date price market_cap
Length:289244 Min. :1995-12-29 Min. : 0.00 Min. : 0.0
Class :character 1st Qu.:2002-10-31 1st Qu.: 12.37 1st Qu.: 427.9
Mode :character Median :2009-08-31 Median : 24.60 Median : 1783.9
Mean :2009-08-16 Mean : 47.53 Mean : 14358.9
3rd Qu.:2016-06-30 3rd Qu.: 45.85 3rd Qu.: 8226.0
Max. :2023-03-31 Max. :5908.87 Max. :2913283.9
NA's :3973
price_to_book debt_to_equity profitability volatility
Min. :0.00e+00 Min. : 0.0 Min. :-3300700 Min. : 0.00
1st Qu.:1.32e+00 1st Qu.: 18.6 1st Qu.: 3 1st Qu.: 21.70
Median :2.04e+00 Median : 52.9 Median : 7 Median : 30.71
Mean :6.92e+00 Mean : 203.0 Mean : -238 Mean : 39.13
3rd Qu.:3.42e+00 3rd Qu.: 107.6 3rd Qu.: 15 3rd Qu.: 45.38
Max. :1.24e+05 Max. :1188136.9 Max. : 370300 Max. :5137.32
NA's :20126 NA's :10061 NA's :6075 NA's :160
revenue ghg_s1 ghg_s2 ghg_s3
Min. : -27966 Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 100 1st Qu.: 31.29 1st Qu.: 66.94 1st Qu.: 54.9
Median : 444 Median : 212.20 Median : 279.61 Median : 549.5
Mean : 19534 Mean : 5209.68 Mean : 1138.86 Mean : 30353.2
3rd Qu.: 2089 3rd Qu.: 2140.41 3rd Qu.: 1000.00 3rd Qu.: 9130.0
Max. :31379507 Max. :145500.00 Max. :29000.00 Max. :1169970.0
NA's :5078 NA's :251669 NA's :253527 NA's :264840
return year_month
Min. :-0.999900 Length:289244
1st Qu.:-0.046082 Class :character
Median : 0.007174 Mode :character
Mean : 0.011709
3rd Qu.: 0.061675
Max. : 4.140351
Points to note:
- There are outliers present in some of the variables as max is considerably higher than the upper quartile values.
- There are missing values in most of the columns with high number in all ghg columns.
- Negative values in profitability and revenue need to be investigated as they could be anomalies.
- Normalization might be required for the columns due to the high range of values.
cat("Percentage of negative revenue values:", (sum(df$revenue < 0, na.rm = TRUE) / nrow(df)) * 100, "%\n")Percentage of negative revenue values: 0.04529048 %
cat("Number of tickers with negative revenue:", length(unique(df[df$revenue < 0, ]$ticker)), "\n")Number of tickers with negative revenue: 22
negative_revenue <- aggregate(revenue ~ ticker, data = df, function(x) mean(x < 0) * 100)
# Checking the percentage of values that are negative for each ticker
negative_revenue[negative_revenue$revenue > 0, ]| ticker | revenue | |
|---|---|---|
| 22 | AIG US Equity | 0.3048780 |
| 96 | BERK US Equity | 0.4184100 |
| 104 | BK US Equity | 0.9146341 |
| 107 | BLX US Equity | 0.9146341 |
| 116 | BPT US Equity | 1.2195122 |
| 165 | CINF US Equity | 0.9146341 |
| 178 | CMS US Equity | 0.3048780 |
| 205 | CSWC US Equity | 14.8606811 |
| 220 | CVU US Equity | 0.9146341 |
| 300 | FMCC US Equity | 1.2195122 |
| 302 | FNMA US Equity | 0.3048780 |
| 383 | IEP US Equity | 0.9146341 |
| 464 | MBI US Equity | 8.8414634 |
| 504 | MSB US Equity | 0.9174312 |
| 566 | OFG US Equity | 0.3048780 |
| 643 | RCL US Equity | 0.9146341 |
| 644 | RDN US Equity | 2.4390244 |
| 648 | RES US Equity | 0.3048780 |
| 725 | STRS US Equity | 0.3048780 |
| 865 | WTM US Equity | 2.1341463 |
| 876 | Y US Equity | 0.9146341 |
aue_df <- df[df$ticker == "AIG US Equity", ]
ggplot(aue_df, aes(x = date, y = revenue)) +
geom_line(size = 0.2, color = "blue") +
labs(title = "Revenue Over Time for AIG US Equity",
x = "Date",
y = "Revenue") +
theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
aue_df$date[aue_df$revenue < 0][1] "2008-12-31"
Apparantely, there was a scandal at this time when the revenue went so low for AIG US Equity. Checking the same for some other company that has about 14% negative values for revenue.
cue_df <- df[df$ticker == "CSWC US Equity", ]
ggplot(cue_df, aes(x = date, y = revenue)) +
geom_line(size = 0.2, color = "blue") +
labs(title = "Revenue Over Time for CSWC US Equity",
x = "Date",
y = "Revenue") +
theme_minimal()Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_line()`).
cue_df$date[cue_df$revenue < 0] [1] NA NA NA NA NA
[6] "1998-09-30" "1998-10-30" "1998-11-30" "1999-03-31" "1999-04-30"
[11] "1999-05-31" "1999-09-30" "1999-10-29" "1999-11-30" "2000-03-31"
[16] "2000-04-28" "2000-05-31" "2000-09-30" "2000-10-31" "2000-11-30"
[21] "2000-12-31" "2001-01-31" "2001-02-28" "2001-04-30" "2001-05-31"
[26] "2001-09-30" "2001-10-31" "2001-11-30" "2002-06-30" "2002-07-31"
[31] "2002-08-30" "2002-09-30" "2002-10-31" "2002-11-29" "2003-04-30"
[36] "2003-05-30" "2004-06-30" "2004-07-30" "2004-08-31" "2004-09-30"
[41] "2004-10-29" "2004-11-30" "2006-06-30" "2006-07-31" "2006-08-31"
[46] "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31" "2008-01-31"
[51] "2008-02-29" "2008-04-30" "2008-05-30"
This is spread across time.
For the companies where the negative values are less that 2%, I will replace it with 0 and for the rest two companies “MBI US Equity” and “CSWC US Equity”, I am going to drop them from the dataset.
# Dropping two companies
df <- df |>
filter(ticker != "MBI US Equity" & ticker != "CSWC US Equity")df <- df |>
mutate(revenue = ifelse(revenue < 0, 0, revenue))Missing values and duplicates
Missing values
# Calculating the null value percentage in each row
round(colSums(is.na(df)) / nrow(df) * 100, 2) ticker date price market_cap price_to_book
0.00 0.00 0.00 1.37 6.97
debt_to_equity profitability volatility revenue ghg_s1
3.48 2.10 0.06 1.76 86.98
ghg_s2 ghg_s3 return year_month
87.62 91.54 0.00 0.00
# Calculating null values in ghg columns post 2011
subset_df <- df |>
subset(date > as.Date("2011-12-31"))
round(colSums(is.na(subset_df)) / nrow(subset_df) * 100, 2) ticker date price market_cap price_to_book
0.00 0.00 0.00 0.59 7.22
debt_to_equity profitability volatility revenue ghg_s1
3.80 1.87 0.06 1.47 73.27
ghg_s2 ghg_s3 return year_month
74.66 81.93 0.00 0.00
There are a few null values present in market_cap, price_to_book, debt_to_equity, profitability, volatility, revenue columns, however all ghg columns have more than 80% null values. Even though the ghg started in 2011, we still have 70-80% null values for each of the ghg columns.
# Used gpt for this
# Step 1 & 2: Check for non-missing values and summarize
ticker_summary <- subset_df %>% #subset_df is df after 2011
group_by(ticker) %>%
summarise(
has_ghg_s1 = any(!is.na(ghg_s1)),
has_ghg_s2 = any(!is.na(ghg_s2)),
has_ghg_s3 = any(!is.na(ghg_s3))
)
# Step 3: Calculate the percentage of tickers with non-missing values
percentage_ghg_s1 = sum(ticker_summary$has_ghg_s1) / nrow(ticker_summary) * 100
percentage_ghg_s2 = sum(ticker_summary$has_ghg_s2) / nrow(ticker_summary) * 100
percentage_ghg_s3 = sum(ticker_summary$has_ghg_s3) / nrow(ticker_summary) * 100
# Print the results
print(paste("Percentage of tickers with at least one non-missing GHG S1 value:", percentage_ghg_s1))[1] "Percentage of tickers with at least one non-missing GHG S1 value: 54.3083900226757"
print(paste("Percentage of tickers with at least one non-missing GHG S2 value:", percentage_ghg_s2))[1] "Percentage of tickers with at least one non-missing GHG S2 value: 53.4013605442177"
print(paste("Percentage of tickers with at least one non-missing GHG S3 value:", percentage_ghg_s3))[1] "Percentage of tickers with at least one non-missing GHG S3 value: 39.9092970521542"
I tried to check whether at least one GHG S1 value is present for each company, so I could impute values for that company with the same value, even though this isn’t the right approach as green house gas values change every year and are dependent on various factors for each scope. This still wouldn’t work out as for about 40% of the companies still don’t have any ghg value for each scope.
Unable to find a solution for ghg_s1, s2, s3 so I have decided to drop these columns for the purpose of this project.
# dropping ghg columns
df_cleaner <- drop_ghg_columns(df)
head(df_cleaner)| ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | return |
|---|---|---|---|---|---|---|---|---|---|
| AAON US Equity | 1995-12-31 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 0.8628 | 71.728 | 14.720 | -0.0980883 |
| AAON US Equity | 1996-01-31 | 0.4719 | 32.8520 | 2.4256 | 85.9073 | 3.0722 | 63.087 | 67.346 | -0.0651743 |
| AAON US Equity | 1996-02-29 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 3.0722 | 97.639 | 67.346 | 0.0697182 |
| AAON US Equity | 1996-03-31 | 0.4170 | 29.0367 | 2.0805 | 65.1878 | 3.1180 | 100.450 | 13.438 | -0.1739303 |
| AAON US Equity | 1996-04-30 | 0.3841 | 26.7444 | 1.9162 | 65.1878 | 3.1180 | 76.133 | 13.438 | -0.0788969 |
| AAON US Equity | 1996-05-31 | 0.3951 | 27.5445 | 1.9710 | 65.1878 | 3.1180 | 88.304 | 13.438 | 0.0286384 |
# Checking if null values are specific to a few tickers
tickers_always_null <- df_cleaner |>
group_by(ticker) |>
summarise(
all_na_market_cap = all(is.na(market_cap)),
all_na_price_to_book = all(is.na(price_to_book)),
all_na_debt_to_equity = all(is.na(debt_to_equity)),
all_na_profitability = all(is.na(profitability)),
all_na_volatility = all(is.na(volatility)),
all_na_revenue = all(is.na(revenue))
) |>
filter(
all_na_market_cap |
all_na_price_to_book |
all_na_debt_to_equity |
all_na_profitability |
all_na_volatility |
all_na_revenue
) |>
pull(ticker)
# tickers_always_null now contains tickers with always null values in specified columns
length(tickers_always_null)[1] 40
There are 40 tickers that have null values for one of the columns always. So for those tickers, there is no way to impute the null values for the purpose of this project. I could impute those with 0, but for this project I would leave it as is as imputing it with any value would add to bias.
tickers_always_null [1] "ADX US Equity" "AEM US Equity" "ASA US Equity" "AU US Equity"
[5] "AZN US Equity" "BBVA US Equity" "BCE US Equity" "BCS US Equity"
[9] "BHP US Equity" "BP US Equity" "CCU US Equity" "CET US Equity"
[13] "ENB US Equity" "GAM US Equity" "GFI US Equity" "GILT US Equity"
[17] "GSK US Equity" "HFC US Equity" "HMC US Equity" "IMO US Equity"
[21] "KOF US Equity" "NVO US Equity" "PEO US Equity" "PHG US Equity"
[25] "PHI US Equity" "RIO US Equity" "SAN US Equity" "SQM US Equity"
[29] "SSL US Equity" "TEF US Equity" "TEVA US Equity" "TGA US Equity"
[33] "TGB US Equity" "TM US Equity" "TRP US Equity" "TRXC US Equity"
[37] "UL US Equity" "WBK US Equity" "WPP US Equity" "YPF US Equity"
Dropping the data for 40 tickers which is about 4.5% of the tickers.
df_cleanest <- df_cleaner |>
filter(!(ticker %in% tickers_always_null))
round(colSums(is.na(df_cleanest)) / nrow(df_cleanest) * 100, 4) ticker date price market_cap price_to_book
0.0000 0.0000 0.0000 0.3517 2.7152
debt_to_equity profitability volatility revenue return
2.8824 1.4113 0.0508 1.0650 0.0000
There are still some missing values present in market_cap, price_to_book, debt_to_equity, profitability, volatility and revenue columns.
For these values, i will impute them with either the next or previous value for that ticker for that column.
df_cleanmax <- df_cleanest |>
group_by(ticker) |>
fill(everything(), .direction = "updown")
round(colSums(is.na(df_cleanmax)) / nrow(df_cleanmax) * 100, 4) ticker date price market_cap price_to_book
0 0 0 0 0
debt_to_equity profitability volatility revenue return
0 0 0 0 0
There are no missing values in our df_cleanmax dataset.
Duplicates
Checking for duplicates.
duplicates <- df_cleanmax[duplicated(df_cleanmax), ]
length(duplicates)[1] 10
There are 10 duplicated rows in the dataset.
df_cleanmaxpro <- distinct(df_cleanmax)Exploratory data analysis
stocks <- df_cleanmaxproUnivariate analysis
stocks |>
ggplot(aes(x = price)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of Price",
x = "Price") +
theme_light()stocks |>
ggplot(aes(x = price)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of Price",
x = "Price") +
theme_light() +
theme(axis.text.x = element_blank())# Function to detect outliers
detect_outliers <- function(data, variable, threshold = 1.5) {
q1 <- quantile(data[[variable]], 0.25)
q3 <- quantile(data[[variable]], 0.75)
iqr <- q3 - q1
lower_bound <- q1 - threshold * iqr
upper_bound <- q3 + threshold * iqr
outliers <- data[data[[variable]] < lower_bound | data[[variable]] > upper_bound, "ticker"]
return(outliers)
}
# Detect outliers for the price
outliers <- detect_outliers(stocks, "price")
table(outliers)ticker
AAPL US Equity ABMD US Equity ABT US Equity ADBE US Equity ADI US Equity
33 84 32 82 52
ADM US Equity ADP US Equity ADSK US Equity AEP US Equity AFG US Equity
1 76 71 6 57
AGCO US Equity AIG US Equity AIN US Equity AIT US Equity AJG US Equity
27 153 4 17 36
ALG US Equity ALK US Equity ALL US Equity ALX US Equity AMAT US Equity
55 1 47 235 22
AMGN US Equity AMSC US Equity AMWD US Equity AN US Equity ANAT US Equity
121 145 13 23 167
AON US Equity APA US Equity APD US Equity ARCB US Equity ARW US Equity
89 35 116 2 27
ASGN US Equity ASH US Equity ATO US Equity ATR US Equity ATRI US Equity
13 14 41 55 183
ATVI US Equity AVY US Equity AWR US Equity AXP US Equity AXR US Equity
1 64 2 57 2
AZO US Equity BA US Equity BANF US Equity BBSI US Equity BBY US Equity
202 123 3 2 19
BC US Equity BCPC US Equity BDX US Equity BH US Equity BIIB US Equity
5 47 116 315 139
BIO US Equity BMI US Equity BOKF US Equity BPOP US Equity BPT US Equity
150 17 16 130 27
BXMT US Equity C US Equity CACC US Equity CACI US Equity CAR US Equity
141 152 126 83 19
CASY US Equity CAT US Equity CB US Equity CBB US Equity CBRL US Equity
91 93 112 29 110
CCF US Equity CCK US Equity CDE US Equity CDNS US Equity CDR US Equity
48 16 26 33 58
CFR US Equity CHCO US Equity CHD US Equity CHDN US Equity CHE US Equity
43 3 5 50 105
CI US Equity CINF US Equity CLDX US Equity CLF US Equity CLH US Equity
102 33 202 5 16
CLX US Equity CMA US Equity CMI US Equity CMO US Equity CNBKA US Equity
102 2 136 7 21
CNMD US Equity CNR US Equity COKE US Equity COO US Equity COP US Equity
26 86 99 123 11
COST US Equity CPF US Equity CPK US Equity CPT US Equity CRK US Equity
128 161 30 42 95
CRMT US Equity CRUS US Equity CRVL US Equity CSL US Equity CTAS US Equity
23 2 28 86 82
CUZ US Equity CVM US Equity CVS US Equity CVX US Equity CW US Equity
2 171 19 124 66
DD US Equity DDS US Equity DE US Equity DHI US Equity DHR US Equity
8 41 78 5 63
DIN US Equity DIOD US Equity DIS US Equity DORM US Equity DOV US Equity
9 2 88 13 42
DTE US Equity DUK US Equity DVN US Equity DX US Equity DY US Equity
48 20 6 34 10
EA US Equity ECL US Equity ED US Equity EFX US Equity EGP US Equity
61 115 4 93 53
EMR US Equity EOG US Equity ESE US Equity ETN US Equity ETR US Equity
3 42 6 32 60
EXPD US Equity EXPO US Equity FBP US Equity FCFS US Equity FCNCA US Equity
24 14 124 5 267
FDX US Equity FICO US Equity FISV US Equity FMC US Equity FOSL US Equity
148 86 40 34 30
FRT US Equity GD US Equity GE US Equity GILD US Equity GLW US Equity
119 111 234 16 2
GPC US Equity GWW US Equity HAE US Equity HAS US Equity HD US Equity
51 160 25 23 102
HEI US Equity HELE US Equity HES US Equity HON US Equity HOV US Equity
45 65 23 93 158
HP US Equity HRC US Equity HSY US Equity HUM US Equity IBCP US Equity
8 42 89 113 95
IBM US Equity ICUI US Equity IDA US Equity IDXX US Equity IEP US Equity
213 91 42 80 23
IEX US Equity IFF US Equity IMCI US Equity IMKTA US Equity INTU US Equity
72 104 76 1 92
IPAR US Equity ITIC US Equity ITT US Equity ITW US Equity JACK US Equity
5 81 2 87 21
JBHT US Equity JBSS US Equity JJSF US Equity JKHY US Equity JNJ US Equity
64 2 102 71 107
JOUT US Equity JPM US Equity KAI US Equity KEX US Equity KLAC US Equity
13 61 48 11 67
KMB US Equity KSU US Equity KWR US Equity LANC US Equity LEE US Equity
115 92 80 92 148
LEN US Equity LFUS US Equity LLY US Equity LNN US Equity LOW US Equity
11 95 59 39 52
LRCX US Equity LSTR US Equity MAN US Equity MATX US Equity MCD US Equity
77 64 23 3 108
MDT US Equity MGA US Equity MGPI US Equity MGRC US Equity MHK US Equity
35 1 9 4 118
MIDD US Equity MKC US Equity MKL US Equity MLAB US Equity MMC US Equity
92 5 315 91 45
MMM US Equity MOS US Equity MRK US Equity MS US Equity MSA US Equity
124 7 6 6 56
MSEX US Equity MSFT US Equity MSI US Equity MTB US Equity MTH US Equity
9 59 84 163 14
MTRN US Equity MTZ US Equity MXIM US Equity NATH US Equity NAVB US Equity
2 6 20 1 28
NBR US Equity NDSN US Equity NEU US Equity NKE US Equity NOC US Equity
305 80 158 34 114
NPK US Equity NSC US Equity NTRS US Equity NTZ US Equity NUE US Equity
52 90 35 39 22
NVR US Equity NWLI US Equity ODFL US Equity ODP US Equity ORLY US Equity
269 281 55 137 123
OSK US Equity OXM US Equity OXY US Equity PARD US Equity PAYX US Equity
15 8 8 162 25
PEP US Equity PG US Equity PGR US Equity PH US Equity PII US Equity
92 50 20 117 75
PKI US Equity PLXS US Equity PNC US Equity PNW US Equity PPG US Equity
36 5 77 1 99
PRGO US Equity PRK US Equity PSA US Equity PSB US Equity PTC US Equity
52 103 155 84 31
PVH US Equity PZZA US Equity QCOM US Equity QDEL US Equity R US Equity
80 12 33 28 1
RAD US Equity RBC US Equity RCL US Equity REGN US Equity RELV US Equity
126 72 37 134 1
RGA US Equity RGEN US Equity RGLD US Equity RHI US Equity RIG US Equity
75 37 41 10 33
RJF US Equity RLI US Equity ROG US Equity ROP US Equity ROST US Equity
16 28 71 133 33
RPM US Equity SAFM US Equity SANM US Equity SBCF US Equity SBUX US Equity
3 74 32 39 19
SCL US Equity SEB US Equity SFE US Equity SHW US Equity SIG US Equity
37 328 19 77 28
SIGI US Equity SJM US Equity SLB US Equity SMG US Equity SNA US Equity
2 118 14 39 115
SNPS US Equity SNV US Equity SPB US Equity STE US Equity STLY US Equity
52 5 21 59 139
STT US Equity SWK US Equity SWKS US Equity SXI US Equity SXT US Equity
6 86 50 27 2
SYK US Equity TAP US Equity TARO US Equity TECH US Equity TER US Equity
89 9 63 11 24
TFX US Equity TGT US Equity THC US Equity THO US Equity TISI US Equity
111 43 28 30 184
TMO US Equity TPL US Equity TROW US Equity TRV US Equity TTC US Equity
114 113 62 101 16
TTEK US Equity TXN US Equity TYL US Equity UFI US Equity UHS US Equity
30 63 105 13 103
UHT US Equity UIS US Equity UMBF US Equity UNF US Equity UNH US Equity
8 79 6 113 101
UNP US Equity USM US Equity USPH US Equity UTMD US Equity VAR US Equity
92 4 40 12 67
VFC US Equity VHI US Equity VICR US Equity VMC US Equity VMI US Equity
1 43 8 93 144
VRTX US Equity WDC US Equity WDFC US Equity WEC US Equity WHR US Equity
89 10 89 10 131
WIRE US Equity WM US Equity WMT US Equity WRLD US Equity WSM US Equity
18 49 55 46 29
WSO US Equity WST US Equity WTM US Equity WTS US Equity X US Equity
107 62 315 33 17
XLNX US Equity XOM US Equity XRX US Equity Y US Equity ZBRA US Equity
43 13 28 316 74
length(table(outliers))[1] 360
About half of the tickers have outlier values. Since we have data for different tickers, I will treat these outliers per ticker per variable.
stocks |>
ggplot(aes(x = market_cap)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of market_cap",
x = "market_cap") +
theme_light()stocks |>
ggplot(aes(x = market_cap)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of market_cap",
x = "market_cap") +
theme_light() +
theme(axis.text.x = element_blank())stocks |>
ggplot(aes(x = price_to_book)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of price_to_book",
x = "price_to_book") +
theme_light()stocks |>
ggplot(aes(x = price_to_book)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of price_to_book",
x = "price_to_book") +
theme_light() +
theme(axis.text.x = element_blank())cat("Company with the max price_to_book:", stocks$ticker[which.max(stocks$price_to_book)], "\n")Company with the max price_to_book: CVM US Equity
# Checking the price_to_book box plot for this company
cvm_stocks <- subset(stocks, ticker == "CVM US Equity")
cvm_stocks |>
ggplot(aes(x = price_to_book)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of price_to_book",
x = "price_to_book") +
theme_light() +
theme(axis.text.x = element_blank())I will replace these extreme values with the median value of price_to_book for this company as this seems to be an outlier.
# Calculating the median price_to_book
median_p <- median(stocks$price_to_book[stocks$ticker == "CVM US Equity"], na.rm = TRUE)
# Calculating the 90th percentile price_to_book
percentile_9 <- quantile(stocks$price_to_book[stocks$ticker == "CVM US Equity"], probs = 0.9, na.rm = TRUE)
# Finding the indices of price_to_book values above the 90th percentile
upper_10_idx <- which(stocks$ticker == "CVM US Equity" & stocks$price_to_book > percentile_9)
# Imputing the upper 10% price_to_book values with the median for CVM US Equity
stocks$price_to_book[upper_10_idx] <- median_pstocks |>
ggplot(aes(x = debt_to_equity)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of debt_to_equity",
x = "debt_to_equity") +
theme_light()stocks |>
ggplot(aes(x = debt_to_equity)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of debt_to_equity",
x = "debt_to_equity") +
theme_light() +
theme(axis.text.x = element_blank())cat("Company with the max debt_to_equity:", stocks$ticker[which.max(stocks$debt_to_equity)], "\n")Company with the max debt_to_equity: FNMA US Equity
# Checking the debt_to_equity box plot for this company
fue_stocks <- subset(stocks, ticker == "FNMA US Equity")
fue_stocks |>
ggplot(aes(x = debt_to_equity)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of debt_to_equity",
x = "debt_to_equity") +
theme_light() +
theme(axis.text.x = element_blank())I will replace this value with the median value of debt_to_equity for this company as this seems to be an outlier.
# calculating the median debt_to_equity
median_dte<- median(stocks$debt_to_equity[stocks$ticker == "FNMA US Equity"], na.rm = TRUE)
max_dte_idx <- which(stocks$ticker == "FNMA US Equity" & stocks$debt_to_equity == max(stocks$debt_to_equity[stocks$ticker == "FNMA US Equity"], na.rm = TRUE))
# Imputing the maximum debt_to_equity value with the median
stocks$debt_to_equity[max_dte_idx] <- median_dtestocks |>
ggplot(aes(x = debt_to_equity)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of debt_to_equity",
x = "debt_to_equity") +
theme_light()stocks |>
ggplot(aes(x = debt_to_equity)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of debt_to_equity",
x = "debt_to_equity") +
theme_light() +
theme(axis.text.x = element_blank())It still has a lot of outliers but better than the previous distribution.
stocks |>
ggplot(aes(x = profitability)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of profitability",
x = "profitability") +
theme_light()stocks |>
ggplot(aes(x = profitability)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of profitability",
x = "profitability") +
theme_light() +
theme(axis.text.x = element_blank())cat("Percentage of negative profitability values:", (sum(stocks$profitability < 0) / nrow(stocks)) * 100, "%\n")Percentage of negative profitability values: 13.72491 %
Due to my limited knowledge of the domain, after doing some research I checked that profitability can indeed be negative for some companies. Checking if it is specific to a company.
cat("Company with the minimum profitability:", stocks$ticker[which.min(stocks$profitability)], "\n")Company with the minimum profitability: GSS US Equity
# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "GSS US Equity")
gue_stocks |>
ggplot(aes(x = profitability)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of profitability",
x = "profitability") +
theme_light() +
theme(axis.text.x = element_blank())summary(gue_stocks) ticker date price market_cap
Length:328 Min. :1995-12-31 Min. : 0.8309 Min. : 9.717
Class :character 1st Qu.:2002-10-23 1st Qu.: 3.5119 1st Qu.: 123.003
Mode :character Median :2009-08-15 Median : 5.7125 Median : 324.404
Mean :2009-08-14 Mean :12.0790 Mean : 364.548
3rd Qu.:2016-06-07 3rd Qu.:15.6438 3rd Qu.: 502.740
Max. :2023-03-31 Max. :93.1250 Max. :1347.599
price_to_book debt_to_equity profitability volatility
Min. : 0.3059 Min. : 0.00 Min. :-3300700 Min. : 7.727
1st Qu.: 1.3339 1st Qu.: 15.56 1st Qu.: -36 1st Qu.: 49.261
Median : 2.2096 Median : 29.25 Median : -9 Median : 64.143
Mean : 44.5658 Mean : 215.95 Mean : -126277 Mean : 75.921
3rd Qu.: 7.5202 3rd Qu.: 239.56 3rd Qu.: 4 3rd Qu.: 94.133
Max. :622.1912 Max. :3187.23 Max. : 93 Max. :323.290
revenue return
Min. : 0.00 Min. :-0.494444
1st Qu.: 12.72 1st Qu.:-0.133405
Median : 61.91 Median :-0.004281
Mean : 83.58 Mean : 0.018887
3rd Qu.:103.69 3rd Qu.: 0.117813
Max. :550.54 Max. : 1.694444
It’s profitability has some outliers. We will impute them with the median value.
# Calculating the median profitability for GSS US Equity
median_p <- median(stocks$profitability[stocks$ticker == "GSS US Equity"], na.rm = TRUE)
# Calculating the 15th percentile profitability for GSS US Equity
percentile_15 <- quantile(stocks$profitability[stocks$ticker == "GSS US Equity"], probs = 0.15, na.rm = TRUE)
# Finding the indices of profitability values below the 12th percentile
lower_15_idx <- which(stocks$ticker == "GSS US Equity" & stocks$profitability < percentile_15)
# Imputing the lower 15% profitability values with the median for GSS US Equity
stocks$profitability[lower_15_idx] <- median_p# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "GSS US Equity")
gue_stocks |>
ggplot(aes(x = profitability)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of profitability",
x = "profitability") +
theme_light() +
theme(axis.text.x = element_blank())# Performing the same process for "MUX US Equity"
# Calculating the median profitability for MUX US Equity
median_p <- median(stocks$profitability[stocks$ticker == "MUX US Equity"], na.rm = TRUE)
# Calculating the 5th percentile profitability for MUX US Equity
percentile_5 <- quantile(stocks$profitability[stocks$ticker == "MUX US Equity"], probs = 0.1, na.rm = TRUE)
# Finding the indices of profitability values below the 5th percentile
lower_5_idx <- which(stocks$ticker == "MUX US Equity" & stocks$profitability < percentile_5)
# Imputing the lower 5% profitability values with the median for MUX US Equity
stocks$profitability[lower_5_idx] <- median_p# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "MUX US Equity")
gue_stocks |>
ggplot(aes(x = profitability)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of profitability",
x = "profitability") +
theme_light() +
theme(axis.text.x = element_blank())stocks |>
ggplot(aes(x = volatility)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of volatility",
x = "volatility") +
theme_light()stocks |>
ggplot(aes(x = volatility)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_vline(xintercept = quantile(stocks$volatility, probs = c(0.99991)), color = "red", linetype = "dashed") +
labs(title = "Boxplot of volatility",
x = "volatility") +
theme_light() +
theme(axis.text.x = element_blank())cat("Company with the max volatility:", stocks$ticker[which.max(stocks$volatility)], "\n")Company with the max volatility: IMCI US Equity
# Checking the debt_to_equity box plot for this company
iue_stocks <- subset(stocks, ticker == "IMCI US Equity")
iue_stocks |>
ggplot(aes(x = volatility)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +
labs(title = "Boxplot of volatility",
x = "volatility") +
theme_light() +
theme(axis.text.x = element_blank())# Calculating the median volatility
median_v <- median(stocks$volatility[stocks$ticker == "IMCI US Equity"], na.rm = TRUE)
# Calculating the 90th percentile volatility
percentile_9 <- quantile(stocks$volatility[stocks$ticker == "IMCI US Equity"], probs = 0.9, na.rm = TRUE)
# Finding the indices of volatility values above the 90th percentile
upper_10_idx <- which(stocks$ticker == "IMCI US Equity" & stocks$volatility > percentile_9)
# Imputing the upper 10% volatility values with the median for IMCI US Equity
stocks$volatility[upper_10_idx] <- median_vstocks |>
ggplot(aes(x = revenue)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of revenue",
x = "revenue") +
theme_light()stocks |>
ggplot(aes(x = revenue)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of revenue",
x = "revenue") +
theme_light() +
theme(axis.text.x = element_blank())stocks |>
ggplot(aes(x = return)) +
geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
geom_density(aes(y = ..count..), color = "#404080", size = 1) +
labs(title = "Histogram of return",
x = "return") +
theme_light()stocks |>
ggplot(aes(x = return)) +
geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +
labs(title = "Boxplot of return",
x = "return") +
theme_light() +
theme(axis.text.x = element_blank())Checking if all tickers have the same frequency in the dataset.
# counting how many tickers belong to different frequencies
barplot(table(table(stocks$ticker)), main = "Frequency of Ticker Values",
xlab = "Ticker", ylab = "Frequency",
col = "skyblue", border = "black")Majority of the tickers have a frequency of 328 whereas 29 tickers do not have the same frequency.
names(table(stocks$ticker)[table(stocks$ticker)==129])[1] "HR US Equity"
summary(subset(stocks, ticker == "HR US Equity")) ticker date price market_cap
Length:129 Min. :2012-07-31 Min. :18.38 Min. :1970
Class :character 1st Qu.:2015-03-31 1st Qu.:24.08 1st Qu.:3151
Mode :character Median :2017-11-30 Median :26.96 Median :5381
Mean :2017-11-29 Mean :26.62 Mean :4902
3rd Qu.:2020-07-31 3rd Qu.:29.42 3rd Qu.:6033
Max. :2023-03-31 Max. :34.05 Max. :9994
price_to_book debt_to_equity profitability volatility
Min. :0.9687 Min. : 58.12 Min. :-25.526 Min. : 9.225
1st Qu.:1.7555 1st Qu.: 82.72 1st Qu.: 4.828 1st Qu.:15.853
Median :1.9354 Median : 89.64 Median : 7.954 Median :19.240
Mean :1.9832 Mean : 91.76 Mean : 9.225 Mean :21.609
3rd Qu.:2.2347 3rd Qu.:100.53 3rd Qu.: 10.893 3rd Qu.:23.957
Max. :2.8958 Max. :115.05 Max. : 98.773 Max. :92.440
revenue return
Min. : 73.47 Min. :-0.220295
1st Qu.:102.05 1st Qu.:-0.035327
Median :172.30 Median : 0.002463
Mean :218.81 Mean : 0.001582
3rd Qu.:191.49 3rd Qu.: 0.045754
Max. :932.64 Max. : 0.125759
For example, “HR US Equity” stock’s starting year is 2012. I will not remove this data.
Multivariate analysis
Correlation analysis
# selecting variables
selected_vars <- c("price", "market_cap", "price_to_book", "debt_to_equity", "profitability", "volatility", "revenue", "return")
# Calculating Spearman correlation (over pearson) as the data is not normally distributed and we can't assume linear relationship between variables especially as they are ticker dependent
correlation_matrix <- cor(stocks[, selected_vars], method = "spearman")
# Melting the correlation matrix for visualization
melted_correlation <- melt(correlation_matrix)
# Rounding the correlation coefficients to two decimal places
melted_correlation$value <- round(melted_correlation$value, 2)
# used gpt for the colors :)
my_palette <- c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6", "#2171b5", "#08519c", "#08306b")
ggplot(melted_correlation, aes(Var1, Var2, fill = value, label = value)) +
geom_tile(color = "white") +
geom_text(color = "black") + # Add text labels
scale_fill_gradientn(colors = my_palette, name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
coord_fixed()The variables do not exhibit a strong correlation with the target variable but with each other.
Numerical vs categorical
Performing kruskal-Wallis test between the categorical variable ticker and the target variable return. Since, the data is not normally distributed we cannot use ANOVA.
kruskal_result <- kruskal.test(return ~ ticker, data = stocks)
# Print the test result
print(kruskal_result)
Kruskal-Wallis rank sum test
data: return by ticker
Kruskal-Wallis chi-squared = 988.51, df = 841, p-value = 0.0003113
The chi-squared test statistic is 988.51 with 841 degrees of freedom, and the p-value is less than the significance level of 0.05, we reject the null hypothesis.
This indicates that there is evidence to suggest that the median returns differ across the ticker categories.
Hence, there is a statistically significant association between the “return” and “ticker” variables.
Time series analysis
# Aggregating returns by date
agg_returns <- stocks |>
group_by(date) |>
summarise(total_return = sum(return, na.rm = TRUE))
# Plotting aggregate returns over time
gg <- ggplot(agg_returns, aes(date, total_return)) +
geom_line(col = "blue") +
labs(x = "Date", y = "Total Return", title = "Aggregate Returns Over Time")
# Convert ggplot to plotly object
ggplotly(gg, dynamicTicks = TRUE)The returns experience significant fluctuations over the period observed. There are sharp increases and decreases, indicating a high degree of volatility. There are periods of positive and negatives.
stocks$date <- as.Date(stocks$date)
ts_data <- ts(stocks$return, frequency = 1)
arima_model <- auto.arima(ts_data)
summary(arima_model)Series: ts_data
ARIMA(2,0,2) with non-zero mean
Coefficients:
ar1 ar2 ma1 ma2 mean
-0.0252 -0.8403 0.0062 0.8233 0.0119
s.e. 0.0183 0.0118 0.0194 0.0120 0.0002
sigma^2 = 0.01557: log likelihood = 182601.3
AIC=-365190.6 AICc=-365190.6 BIC=-365127.4
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -3.237388e-07 0.124796 0.07939448 NaN Inf 0.6855444 -0.005000474
plot(forecast(arima_model))The non-zero mean indicates that there’s a long-term average effect that the model accounts for, which is important for return series that often exhibit such characteristics. The low error measures and near-zero first autocorrelation of errors suggest that the model’s predictions should be reliable and that the residuals are mostly random, which is a desirable property.
ggplot(stocks, aes(x = date, y = return, color = ticker)) +
geom_line() +
labs(x = "Date", y = "Return", title = "Return Over Date for different Tickers") +
theme_minimal()set.seed(42)
random_tickers <- sample(unique(stocks$ticker), 3)
subset_data <- stocks %>%
filter(ticker %in% random_tickers)
# Plot
ggplot(subset_data, aes(x = date, y = return, color = ticker)) +
geom_line() +
labs(x = "Date", y = "Return", title = "Return Over Date for different tickers")All three equities have periods of significant volatility, as evidenced by the sharp peaks and troughs. GSS US Equity, in green, shows particularly high volatility with several spikes surpassing a return of 1.0 and dipping below -0.5.
None of the equities show a clear long-term trend upwards or downwards; they all seem to fluctuate around a return of zero. This could imply that their prices have oscillated around a mean value, or that the returns have been adjusted to remove any trend.
There are several notable outliers, particularly for GSS US Equity, which may represent specific events that caused drastic increases or decreases in return on those dates.
Machine Learning
Label encoding ticker
stocks$ticker <- as.numeric(factor(stocks$ticker))
# Adding month and year variables
stocks$month <- month(stocks$date)
stocks$year <- year(stocks$date)
head(stocks)| ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | return | month | year |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1995-12-31 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 0.8628 | 71.728 | 14.720 | -0.0980883 | 12 | 1995 |
| 1 | 1996-01-31 | 0.4719 | 32.8520 | 2.4256 | 85.9073 | 3.0722 | 63.087 | 67.346 | -0.0651743 | 1 | 1996 |
| 1 | 1996-02-29 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 3.0722 | 97.639 | 67.346 | 0.0697182 | 2 | 1996 |
| 1 | 1996-03-31 | 0.4170 | 29.0367 | 2.0805 | 65.1878 | 3.1180 | 100.450 | 13.438 | -0.1739303 | 3 | 1996 |
| 1 | 1996-04-30 | 0.3841 | 26.7444 | 1.9162 | 65.1878 | 3.1180 | 76.133 | 13.438 | -0.0788969 | 4 | 1996 |
| 1 | 1996-05-31 | 0.3951 | 27.5445 | 1.9710 | 65.1878 | 3.1180 | 88.304 | 13.438 | 0.0286384 | 5 | 1996 |
Dataset split into train and test
# Now I will split the data set into train and test
threshold_date <- max(stocks$date) - 365
df_train <- subset(stocks, date < threshold_date)
df_test <- subset(stocks, date >= threshold_date)x_train <- df_train[, !names(df_train) %in% c("return", "date")]
y_train <- df_train$return
x_test <- df_test[, !names(df_test) %in% c("return", "date")]
y_test <- df_test$returnFeature engineering
# Apply z-score normalization by ticker for training data
x_train_scaled <- x_train %>%
group_by(ticker) %>%
mutate(across(.cols = everything(), .fns = z_score_normalization)) %>%
ungroup()
# Apply z-score normalization by ticker for testing data
x_test_scaled <- x_test %>%
group_by(ticker) %>%
mutate(across(.cols = everything(), .fns = z_score_normalization)) %>%
ungroup()Modelling
First model
Setting Up Parameters to train
train_params <- list(objective = "regression",
boosting_type = "gbdt",
boost_from_average = "false",
learning_rate = 0.005, # Learning rate
num_leaves = 192, # Max nb leaves in tree
min_data_in_leaf = 100,
feature_fraction = 0.3, # % of features
bagging_freq = 1,
bagging_fraction = 0.7, # % of observations
lambda_l1 = 0,
lambda_l2 = 0)Training the model
bst_k <- lightgbm(
data = as.matrix(x_train_scaled),
label = y_train, # Target / label
params = train_params, # Passing parameter values
nrounds = 2000 , # Number of boosting rounds
eval_freq = 20, # Evaluate the model every 20 rounds
verbose = -1 # Disable verbose output
)Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
# Calculate predictions
predictions <- predict(bst_k, as.matrix(x_test_scaled))
# Calculate RMSE
rmse <- sqrt(mean((predictions - y_test)^2))
# Print RMSE
cat("Root Mean Squared Error (RMSE):", rmse, "\n")Root Mean Squared Error (RMSE): 0.1171035
Let’s use that last model and perform a cross validation
Cross validation
cv_model <- lgb.cv(
params = train_params,
data = as.matrix(x_train_scaled),
label = y_train, # Target / label
eval_freq = 80,
nrounds = 3,
nfold = 5
)[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002948 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211932, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002987 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003047 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002697 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002819 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[1]: valid's l2:0.0157944+0.000533276
[3]: valid's l2:0.0157841+0.000533216
cv_model$record_evals$start_iter
[1] 1
$valid
$valid$l2
$valid$l2$eval
$valid$l2$eval[[1]]
[1] 0.01579436
$valid$l2$eval[[2]]
[1] 0.01579027
$valid$l2$eval[[3]]
[1] 0.01578412
$valid$l2$eval_err
$valid$l2$eval_err[[1]]
[1] 0.0005332762
$valid$l2$eval_err[[2]]
[1] 0.0005333652
$valid$l2$eval_err[[3]]
[1] 0.0005332158
Lower values of l2 loss indicate better model performance in terms of minimizing the squared differences between predicted and actual values.A smaller l2 evaluation error indicates less variability in the l2 loss values across different runs.
Model’s performance, as indicated by the l2 loss, is consistent across multiple runs, with relatively low variability in the evaluation error. This consistency is indicative of the stability of the model’s predictive performance.
Hyper parameter tuning
We set up the first combinations
num_leaves <- c(5,30)
learning_rate <- c(0.01, 0.05, 0.2)
pars <- expand.grid(num_leaves, learning_rate)
num_leaves <- pars[,1]
learning_rate <- pars[,2]Next, the training function. Note that some parameters are flexible, others are fixed.
train_func <- function(num_leaves, learning_rate, x_train_scaled){
train_params <- list( # First, the list of params
num_leaves = num_leaves, # Max nb leaves in tree
learning_rate = learning_rate, # Learning rate
objective = "regression", # Loss function
max_depth = 3, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 4, # Parallelization
force_row_wise = T
)
# Next we train
bst <- lightgbm(
data = as.matrix(x_train_scaled),
label = y_train, # Target / label
params = train_params, # Passing parameter values
eval_freq = 50,
nrounds = 1000,
verbose = -1
)
# Next, we record the final loss (depends on the model/loss defined above)
return(loss = bst$record_evals$train$binary_logloss$eval[[10]])
}
train_func(10, 0.1, x_train_scaled) # TestingWarning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
NULL
train_func <- function(num_leaves, learning_rate, x_train_scaled, y_train) {
train_params <- list(
num_leaves = num_leaves,
learning_rate = learning_rate,
objective = "regression",
max_depth = 3,
min_data_in_leaf = 50,
bagging_fraction = 0.5,
feature_fraction = 0.7,
nthread = 4,
force_row_wise = TRUE
)
bst <- lightgbm(
data = as.matrix(x_train_scaled),
label = y_train,
params = train_params,
eval_freq = 50,
nrounds = 1000,
verbose = -1
)
# Calculate Mean Squared Error (MSE)
mse_loss <- mean((y_train - predict(bst, newdata = as.matrix(x_train_scaled)))^2)
# Convert MSE to RMSE
rmse_loss <- sqrt(mse_loss)
# Return the loss value and the trained model
return(list(loss = rmse_loss, model = bst))
}
# Call the function with appropriate arguments
result <- train_func(10, 0.1, x_train_scaled, y_train)Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
print(result)$loss
[1] 0.1151554
$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
RMSE score is 0.115 so close to our model bst_k.
Finally, we can launch a function that is going to span all free parameters.
grd <- pmap(list(num_leaves = num_leaves, learning_rate = learning_rate), # Parameters for the grid search
~ train_func(..1, ..2, x_train_scaled, y_train) %>%
set_names(c("loss", "model")), # Function on which to apply the grid search
x_train_scaled = x_train_scaled, # Non-changing argument (data is fixed)
y_train = y_train # Non-changing argument (labels are fixed)
)Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
grd[[1]]
[[1]]$loss
[1] 0.1214682
[[1]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
[[2]]
[[2]]$loss
[1] 0.1209835
[[2]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
[[3]]
[[3]]$loss
[1] 0.1180075
[[3]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
[[4]]
[[4]]$loss
[1] 0.1170967
[[4]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
[[5]]
[[5]]$loss
[1] 0.1139586
[[5]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
[[6]]
[[6]]$loss
[1] 0.1129043
[[6]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
gr <- bind_cols(as.data.frame(pars), as_tibble(map_dbl(grd, "loss")))
gr| Var1 | Var2 | value |
|---|---|---|
| 5 | 0.01 | 0.1214682 |
| 30 | 0.01 | 0.1209835 |
| 5 | 0.05 | 0.1180075 |
| 30 | 0.05 | 0.1170967 |
| 5 | 0.20 | 0.1139586 |
| 30 | 0.20 | 0.1129043 |
So basically it tells us that the best hyper parameters combinations are 30 Leaves and a lr of 0.20 for a rmse loss score of 0.112
Let’s calculate the predictions again
# Calculate predictions
trained_model <- result$model
predictions <- predict(trained_model, as.matrix(x_test_scaled))
# Calculate RMSE
rmse <- sqrt(mean((predictions - y_test)^2))
# Print RMSE
cat("Root Mean Squared Error (RMSE):", rmse, "\n")Root Mean Squared Error (RMSE): 0.1159798
Model interpretability
Feature Importances
importance <- lgb.importance(trained_model)
ggplot(importance, aes(x = Gain, y = reorder(Feature, Gain))) +
geom_col(fill = "#22AABB", alpha = 0.7) +
theme_bw() +
theme(axis.title.y = element_blank(),
axis.text.y = element_text(size = 6),
plot.title = element_text(size = 10)) +
coord_cartesian(clip = "off", ylim = c(0, length(importance$Feature))) +
labs(title = "Feature Importance") +
theme(plot.margin = margin(10, 30, 10, 10))Conclusion
In conclusion, it is evident that both the year and month variables hold paramount significance, given the financial nature of the data and its temporal dependency. Additionally, volatility exhibits a robust association with fluctuations in returns. Moreover, factors such as the stock’s identity, represented by its ticker symbol, and its market capitalization size exert substantial influence on the outcomes.
However, considering the high correlation among certain variables, their individual importance diminishes as much of their information is already captured by other variables. Moving forward, employing advanced models like Neural Prophet could offer promising avenues for further analysis and predictive modeling, potentially yielding deeper insights into the dataset.